Skip to content

Remove separate syntax heads for each operator #575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: kf/dots
Choose a base branch
from
Open

Conversation

Keno
Copy link
Member

@Keno Keno commented Jul 11, 2025

This replaces all the specialized operator heads by a single K"Operator" head that encodes the precedence level in its flags (except for operators that are also used for non-operator purposes). The operators are already K"Identifier" in the final parse tree. There is very little reason to spend all of the extra effort separating them into separate heads only to undo this later. Moreover, I think it's actively misleading, because it makes people think that they can query things about an operator by looking at the head, which doesn't work for suffixed operators.

Additionally, this removes the op= token, replacing it by two tokens, one K"Operator" with a special precendence level and one =. This then removes the last use of bump_split (since this PR is on top of #573).

As a free bonus this prepares us for having compound assignment syntax for suffixed operators, which was infeasible in the flips parser. That syntax change is not part of this PR but would be trivial (this PR makes it an explicit error).

Fixes #334

Unfortunately, the sequences `..` and `...` do not always refer
to the `..` operator or the `...` syntax. There are two and a half cases
where they don't:

1. After `@` in macrocall, where they are both regular identifiers
2. In `import ...A` where the dots specify the level
3. `:(...)` treats `...` as quoted identifier

Case 1 was handled in a previous commit by lexing these as identifiers
after `2`.

However, as a result of case 2, it is problematic to tokenize these dots
together; we essentially have to untokenize them in the import parser. It
is also infeasible to change the lexer to have speical context-sensitive
lexing in `import`, because there could be arbitrary interpolations,
`@eval import A, $(f(x..y)), ..b`, so deciding whether a particular
`..` after import refers to the operator or a level specifier requires
the parser.

Currently the parser handles this by splitting the obtained tokens
again in the import parser, but this is undesirable, because it
invalidates the invariant that the tokens produced by the lexer
correspond to the non-terminals of the final parse tree.

This PR attempts to address this by only ever having the lexer emit
`K"."` and having the parser decide which case it refers to.
The new non-terminal `K"dots"` handles the identifier cases (ordinary
`..` and quoted `:(...)` ). K"..." is now exclusively used for
splat/slurp, and is no longer used in its non-terminal form for
case 3.
Copy link

codecov bot commented Jul 11, 2025

Codecov Report

Attention: Patch coverage is 98.11321% with 4 lines in your changes missing coverage. Please review.

Please upload report for BASE (kf/dots@daf52ca). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/julia/kinds.jl 85.00% 3 Missing ⚠️
src/julia/tokenize.jl 99.12% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             kf/dots     #575   +/-   ##
==========================================
  Coverage           ?   95.41%           
==========================================
  Files              ?       16           
  Lines              ?     4578           
  Branches           ?        0           
==========================================
  Hits               ?     4368           
  Misses             ?      210           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This replaces all the specialized operator heads by a single K"Operator"
head that encodes the precedence level in its flags (except for operators
that are also used for non-operator purposes). The operators are
already K"Identifier" in the final parse tree. There is very little
reason to spend all of the extra effort separating them into separate
heads only to undo this later. Moreover, I think it's actively misleading,
because it makes people think that they can query things about an operator
by looking at the head, which doesn't work for suffixed operators.

Additionally, this removes the `op=` token, replacing it by two tokens,
one K"Operator" with a special precendence level and one `=`. This then
removes the last use of `bump_split` (since this PR is on top of #573).

As a free bonus this prepares us for having compound assignment syntax
for suffixed operators, which was infeasible in the flips parser. That
syntax change is not part of this PR but would be trivial (this PR
makes it an explicit error).

Fixes #334
Copy link
Member

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I like this a lot in overview and I think the idea is right.

But there's a fair bit to clean up in the implementation and I'm going to be honest, this took an absolute ton of time to review.

One pervasive issue I find confusing in the parser.jl code changes is that isassign output of peek_dotted_op_token() is often ignored, but not always. Which cases is this actually ok for? One practical difference between ignoring isassign vs not is the difference between the following errors:

isassign checked for:

julia> parsestmt(SyntaxNode, "x |>= y")
ERROR: ParseError:
# Error @ line 1:3
x |>= y
# └┘ ── Compound assignment is not allowed for this operator

vs isassign not checked for:

julia> parsestmt(SyntaxNode, "x ..= y")
ERROR: ParseError:
# Error @ line 1:5
x ..= y
#   ╙ ── unexpected `=`

Was Claude or other AI tool used for the code changes?

I feel we may need a bunch more tests to check that isassign is used correctly.

Comment on lines +180 to +181
@test tok("1+=2", 2).kind == K"Operator" # + before =
@test tok("1+=2", 3).kind == K"="
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing multiple tokens with the same input, I suggest toks():

Suggested change
@test tok("1+=2", 2).kind == K"Operator" # + before =
@test tok("1+=2", 3).kind == K"="
@test toks("1+=2")[2:3] == ["+"=>K"Operator", "="=>K"="]

(a lot of tests for tokenize.jl were written over time with various test tooling and haven't necessarily been updated to the latest way to do these things)

@@ -1217,5 +653,5 @@ function is_syntactic_operator(x)
# in the parser? The lexer itself usually disallows such tokens, so it's
# not clear whether we need to handle them. (Though note `.->` is a
# token...)
return k in KSet"&& || . ... ->" || is_syntactic_assignment(k)
return k in KSet"&& || . ... -> = :="
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change we now have

julia> JuliaSyntax.is_syntactic_operator(K".=")
false

Whereas it used to be true. Was this intentional?

function is_plain_equals(t)
kind(t) == K"=" && !is_suffixed(t)
end
is_plain_equals(t) = kind(t) == K"="
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this function, it's only used in two places and the test is now trivial.

Comment on lines 427 to 431
"""
emit(l::Lexer, kind::Kind)

Returns a `RawToken` of kind `kind` with contents `str` and starts a new `RawToken`.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
emit(l::Lexer, kind::Kind)
Returns a `RawToken` of kind `kind` with contents `str` and starts a new `RawToken`.
"""

wrong docstring

@@ -608,14 +600,18 @@ function parse_assignment_with_initial_ex(ps::ParseState, mark, down::T) where {
# a += b ==> (+= a b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs fixing I guess (or delete it because it's covered in the if-else below)

emit(ps, mark, leading_dot ? K".op=" : K"op=")
if check_identifiers
# += ==> (error (op= +))
# .+= ==> (error (. (op= +)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct

Suggested change
# .+= ==> (error (. (op= +)))
# .+= ==> (error (.op= +))

@@ -76,6 +76,8 @@ tests = [
"f(x) where S where U = 1" => "(function-= (where (where (call f x) S) U) 1)"
"(f(x)::T) where S = 1" => "(function-= (where (parens (::-i (call f x) T)) S) 1)"
"f(x) = 1 = 2" => "(function-= (call f x) (= 1 2))" # Should be a warning!
# Bad assignment with suffixed op
((v = v"1.12",), "a +₁= b") => "(op= a (error +₁) b)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a version-specific error as implemented

Suggested change
((v = v"1.12",), "a +₁= b") => "(op= a (error +₁) b)"
"a +₁= b" => "(op= a (error +₁) b)"

Comment on lines +218 to +220
tokens = tokenize("+₁")
@test length(tokens) == 1 # Just the identifier, endmarker is not included in tokenize()
@test kind(tokens[1]) == K"Identifier"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokens = tokenize("+₁")
@test length(tokens) == 1 # Just the identifier, endmarker is not included in tokenize()
@test kind(tokens[1]) == K"Identifier"
@test tokensplit("+₁") == [K"Identifier"=>"+₁"]

@testset "dotted and suffixed operators" begin

for opkind in Tokenize._nondot_symbolic_operator_kinds()
for opkind in _nondot_symbolic_operator_kinds()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect - this test now omits many many operators? But it used to depend on the fact that all the operator kinds were listed individually.

Instead, I guess we should have a big list of all the allowable operators here, separate from the list in Tokenize.

is_syntactic_operator(leading_kind) ? leading_kind : K"Identifier")

# Check if this is a compound assignment operator pattern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant comment

Suggested change
# Check if this is a compound assignment operator pattern

@Keno
Copy link
Member Author

Keno commented Aug 10, 2025

Thanks for the extensive review - I'll wait until the other PR is merged to rebase and fix those up.

Was Claude or other AI tool used for the code changes?

I used claude off and on for the big changeset that this was extracted out of (along with several of the earlier PRs). However, I think the things you flagged as most objectionable are not Claude's fault, but rather an artifact of multiple versions of iterations and rebases as the earlier patches in this sequence were cleaned up and put in individually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants